library(tidyverse)
library(readxl)
library(data.table)
library(corpora)
library(psych)
library(skimr)Working with data sets
Data sets
We can import data in a number of ways. R generally prefers CSV files, but there are packages to read in other file formats (Excel, SPSS, JSON, etc.).
We cannot go into all of the details – but here is a whole course on the topic if you want to learn more: https://learn.datacamp.com/courses/importing-data-in-r-part-1
Reading in data
Those with some R experience probably already know read.table(), read.csv(), read.csv2() etc. (Note that, as of R version 4.0, the parameter stringsAsFactors is set to FALSE by default.)
Alternatively, you can read in a data set as a tibble which is a little faster. For really large files, the data.table package offers the function fread().
Which function is best suited to read in a specific file depends on the file format and the file’s formatting (field separator, decimal point, etc.). (Exception: fread(). fread() doesn’t care and usually figures this out by itself.)
For CSV files in European format (semicolon as field separator, comma as decimal point), use read_csv2():
gen_blogs <- read_csv2("data/Genitive_DWDS_Blogs.csv")ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
Rows: 17512 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ";"
chr (1): Lemma
dbl (2): s.Genitiv, es.Genitiv
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
We can now have a look at the data:
gen_blogsYou can also use str() to display its internal structure (or the structure of any R object, really) or glimpse() to get an overview over all the columns:
str(gen_blogs)spc_tbl_ [17,512 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Lemma : chr [1:17512] "Leben" "Blog" "Internet" "Artikel" ...
$ s.Genitiv : num [1:17512] 3761 2570 1847 1757 1666 ...
$ es.Genitiv: num [1:17512] 0 0 0 0 0 6 192 0 0 265 ...
- attr(*, "spec")=
.. cols(
.. Lemma = col_character(),
.. s.Genitiv = col_double(),
.. es.Genitiv = col_double()
.. )
- attr(*, "problems")=<externalptr>
glimpse(gen_blogs)Rows: 17,512
Columns: 3
$ Lemma <chr> "Leben", "Blog", "Internet", "Artikel", "Erachten", "Monat"…
$ s.Genitiv <dbl> 3761, 2570, 1847, 1757, 1666, 1562, 1479, 1463, 1260, 1241,…
$ es.Genitiv <dbl> 0, 0, 0, 0, 0, 6, 192, 0, 0, 265, 725, 0, 0, 74, 0, 0, 15, …
When you read in data, you can also specify data types for certain columns:
gen_blogs <- read_csv2("data/Genitive_DWDS_Blogs.csv",
col_types = "?ii")ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
gen_blogsThe first argument is a file path. Since the folder “data” is located in my current working directory, I don’t need to specify the full/absolute path.
Alternatively, you can use file.choose() to select a file:
read_csv2(file.choose())
Try it out!
There’s also read_csv() for classic CSV files (comma as field separator, . as decimal point), read_tsv() for files with tab stops as field separators, and read_delim(), the parent function where you can specify everything yourself.
RStudio offers some options to read in files a little more comfortably: File -> Import Dataset
- From Text (base)…: base R functions:
read.table()etc.: data.frame - From Text (readr)…: tidyverse/readr style: tibble
- From Excel…: Excel files (tibble)
- From SPSS…: SPSS files (tibble)
- From SAS…: SAS files (tibble)
- From Stata…: Stata files (tibble)
Having selected fitting options to import your data and having clicked “Import”, you can see the R command on the console. You can then copy it to your script to speed up the process in the future.
Example: Opening an Excel file:
gen_blogs <- read_excel("data/Genitive_DWDS_Blogs.xlsx")
gen_blogsSo far, we’ve imported all data as tibbles. When you look at these, you can see each column’s data type directly (otherwise, use a function such as str() to display the structure of an object such as a data.frame).
The most common data type abbreviations are:
- chr for character
- fct for factor
- int for integer
- dbl for double
- lgl for logical
For a full list, see: https://tibble.tidyverse.org/articles/types.html.
If you’ve got your own data with you, now’s the time to try to open it!
Otherwise, there are a few variations of the same data available:
- romane.tsv
- romane.csv
- romane2.csv
- romane3.csv
- romane4.csv
Can you read in all of them correctly?
romane <- read_tsv("data/romane.tsv",
col_types = cols(Genre = "f", Kategorie = "f"))
romaneAccessing parts of a data set
To access a column (usually a statistical variable), enter the data set’s name, followed by a dollar sign and the name of the column. We get a vector of values (let’s not display all of them by using head()):
gen_blogs$Lemma |> head()[1] "Leben" "Blog" "Internet" "Artikel" "Erachten" "Monat"
gen_blogs$s.Genitiv |> head()[1] 3761 2570 1847 1757 1666 1562
The weird little operator |> here is called a pipe operator. It was introduced in R 4.1.0 after a similar operator (from the magrittr package), %>%, had already been commonly used in the Tidyverse for quite some time and had gained a lot of traction in the R community.
Both of these operators simply take the object to their left as the (first) input of the function to their right.
This generally makes code more readable – instead of using functions inside of functions inside of functions, just pipe the output to the next function etc.
The last line of code is therefore equivalent to:
head(gen_blogs$s.Genitiv)[1] 3761 2570 1847 1757 1666 1562
There are some subtle differences between |> and %>% (see e.g. here), but in most cases, they can be used interchangeably. The native pipe (|>) is a little faster, however.
Pressing Ctrl+Shift+M (Mac: Cmd+Shift+M) in RStudio inserts a pipe operator (you can select your preferred one in the options).
Just as with vectors, you can use square brackets to subset a data set. You just have to provide two values: row and column.
romane[2, 3] # second row, third columnromane[3, c(1, 3)] # third row, columns 1 and 3romane[3,] # third row, all columns (don't forget the comma!)romane[, 6] # sixth columnTo select certain columns, select() is also useful:
romane |> select(Genre, Autor, Titel)romane |> select(Genre:Type_token_ratio)romane |> select(Type_token_ratio:last_col())You can also rename variables:
romane |> select(Titel,
TTR = Type_token_ratio,
Avg_length = Average_token_length_syllables)If you just want to rename a column while keeping all other columns, rename() might be more practical:
romane |> rename(TTR = Type_token_ratio)select() is also useful to change the order of columns:
romane |> select(Titel, Autor, everything()) # everything(): helper functionFiltering data sets
You’ll often want to get parts of a data set not according to their position, but according to certain conditions which must be fulfilled. That’s what filter() is for (or the base R function subset()).
gen_blogs has 17512 rows – let’s just use the lemmas which appear at least five times in any form (arbitrary choice):
gen_blogs <- gen_blogs |> filter(s.Genitiv >= 5 | es.Genitiv >= 5)
gen_blogsIf several conditions have to be fulfilled, they can be separated by commas:
gen_blogs |> filter(s.Genitiv >= 100, es.Genitiv >= 100)Logical AND works the same way:
gen_blogs |> filter(s.Genitiv >= 100 & es.Genitiv >= 100)There are some lemmas in gen_blogs that shouldn’t be in there.
Let’s throw them out by using %in%:
gen_blogs <- gen_blogs |>
filter(
!(Lemma %in% c("Äußer", "Inner", "Wichtiger", "Schlimmer",
"Besser", "Neu"))
)Try to …
- select all rows in
gen_blogswheres.Genitivis exactly 100 - select all rows in
gen_blogswherees.Genitivis between 100 und 200 - select all rows in
romanewhere the genre is sci-fi, the type-token ratio is greater than 0.35 and Honoré’s H is greater than 2900 - select all rows in
romanewhere the lexical density is smaller than 0.35 or greater than 0.45 - select all rows in
romanewhere the author’s name starts with an “A” (tip:str_detect())
Modifying data
Sometimes, you want to modify certain columns. Luckily, not only can you call a column using the dollar sign notation, you can also assign new values this way.
Alternatively, you can use mutate(), a function that comes in especially handy if you want to modify several columns at once.
In our romane data set, Rarity specifies the fraction of nouns, adjectives and verbs which can also be found in the most common 5000 nouns, adjectives and verbs in a reference corpus (in this case, the DECOW16BX). But this means that a lower value actually signifies a higher rarity (and thus, higher complexity) whereas the other measures in the data set work the other way around (higher value -> higher complexity). Luckily, this is very easy to fix:
romane$Rarity <- 1 - romane$RarityMore cleanup of gen_blogs, using string functions and regular expressions:
Words ending in -nis have been improperly lemmatised (-niss):
str_subset(gen_blogs$Lemma, "niss$") [1] "Bündniss" "Ereigniss"
[3] "Verhältniss" "Ergebniss"
[5] "Verständniss" "Aktionsbündniss"
[7] "Gedächtniss" "Selbstverständniss"
[9] "Verzeichniss" "Wahlergebniss"
[11] "Gefängniss" "Bedürfniss"
[13] "Arbeitsverhältniss" "Bekenntniss"
[15] "Geheimniss" "Wahlgeheimniss"
[17] "Beschäftigungsverhältniss" "Geständniss"
[19] "Missverständniss" "Bankgeheimniss"
[21] "Kapitalverhältniss" "Erlebniss"
[23] "Unverständniss" "Briefgeheimniss"
[25] "Presseerzeugniss" "Fernmeldegeheimniss"
[27] "Vertragsverhältniss" "Einverständniss"
[29] "Gleichniss" "Inhaltsverzeichniss"
[31] "Mietverhältniss" "Arbeitsgedächtniss"
[33] "Begräbniss" "Jahrhundertereigniss"
[35] "Textverständniss" "Untersuchungsergebniss"
[37] "Verhängniss" "Ärgerniss"
gen_blogs$Lemma <- str_replace(gen_blogs$Lemma, "niss$", "nis")There are also very few lemmas with non-alphanumerical characters at the end:
gen_blogs$Lemma <- str_replace(gen_blogs$Lemma, "[^[:alpha:]]$", "")Adding columns
If you want to add a column to an existing data.frame, tibble or data.table, the vector needs to have the same length as the other columns.
There are quite a few ways to do this. The easiest one is probably this:
gen_blogs$Length <- str_length(gen_blogs$Lemma) # word length in characters
gen_blogsOptional step: new column with the number of syllables
# install.packages("sylly")
# install.packages("sylly.de", repo="https://undocumeantit.github.io/repos/l10n")
library(sylly.de)
gen_blogs$Syllables <- hyphen_c(gen_blogs$Lemma, hyph.pattern = "de",
quiet = TRUE)mutate() can be used to add several columns at once, to change existing columns, and to do calculations with columns:
gen_blogs <- gen_blogs |>
mutate(Total = s.Genitiv + es.Genitiv,
Frac_es = round(es.Genitiv / Total, 2))
gen_blogsSorting
Use arrange() to change the order of rows:
gen_blogs |> arrange(desc(es.Genitiv))desc() to sort in descending order
You can also sort by several columns:
gen_blogs |> arrange(Length, Lemma)gen_blogs |>
arrange(desc(Length), desc(s.Genitiv), desc(es.Genitiv))romane |> arrange(Kategorie, Genre, Autor, Titel)“Long” and “wide” format
There are two different presentations for tabular data:
- In “wide” format, each row represents one unit of observation (e.g. a person, a country, a text or a word). There is no redundancy, making it easy to read. In case of repeated measures (e.g. the same statistical variable at different points in time or under different conditions), there are multiple columns.
- In “long” (or “narrow”) format, there are multiple rows for a single unit of observation, one for each condition or point in time. All measurements of the same statistical variable will be in a single column, while another column specifies the category (time, condition, …).
Many functions in R require the input data to be in a specific format (mostly “long”), so you should know how to switch between the two.
As a first example, we’ll use frequency data of selected nouns in the written and spoken parts of the British National Corpus (BNC; see ?BNCcomparison):
BNCcomparison |> as_tibble()Since both written and spoken contain frequencies, we could put these in a single column (frequency), with another column denoting the modality. To transform from “wide” to “long” format, we can use the pivot_longer() function:
BNC_long <- BNCcomparison |>
pivot_longer(cols = written:spoken,
names_to = "modality",
values_to = "frequency")
BNC_longTo transform back to “wide” format, use pivot_wider():
BNC_long |>
pivot_wider(names_from = "modality",
values_from = "frequency")Can you do the same thing with gen_blogs?
There are lots of examples in the vignette – have a look!
vignette("pivot")
The following more complex code shows an example using the romane data. We want to know which genres are more or less complex according to different measures of lexical complexity. Ideally, we’d have a single plot containing all measures by genre. We could use boxplots – but before we can do that, we have two problems to solve:
- Different measures are on very different scales, making visual comparison almost impossible when using the same y-axis.
- The plotting function we want to use requires the values of all measures to be in a single column (“long” format).
Let’s do some piping!
First, we write a little helper function to compute z-scores (we could also skip this step and use the in-built function scale() instead):
zscores <- function(x) {
(x - mean(x)) / sd(x)
}Then, we use this function to mutate our data, before transforming it to long format (and making a factor out of the new Measure column):
romane_long <- romane |>
mutate(Type_token_ratio = zscores(Type_token_ratio),
Honore_H = zscores(Honore_H),
MTLD = zscores(MTLD),
Dispersion = zscores(Dispersion),
Disparity = zscores(Disparity),
Evenness = zscores(Evenness),
Density = zscores(Density),
Rarity = zscores(Rarity),
Average_token_length_syllables = zscores(Average_token_length_syllables)) |>
pivot_longer(cols = Type_token_ratio:Honore_H,
values_to = "Value", names_to = "Measure") |>
mutate(
Measure = factor(
Measure,
levels = c("Average_token_length_syllables",
"Type_token_ratio",
"Honore_H",
"MTLD",
"Disparity",
"Dispersion",
"Evenness",
"Density",
"Rarity"),
labels = c("Mean token length in syllables",
"Type-token ratio",
"Honoré's H",
"McCarthy and Jarvis' MTLD",
"Semantic Disparity",
"Dispersion",
"Evenness",
"Lexical density",
"Rarity")
)
)
romane_longFinally, we can plot it:
romane_long |>
ggplot(aes(x = Measure, y = Value, colour = Genre)) +
geom_boxplot(outlier.alpha = .5) +
theme(axis.text.x = element_text(angle = -45, hjust = 0)) +
labs(y = "z-score", title = "Standardised measures by genre")Summarising data
group_by()creates a grouped tibblesummariseis then used for arbitrary operations (sums, means, standard deviations, …) which are performed by group
Typical descriptive statistics you may want to use:
n(): current group sizemean(): arithmetic meanmean(trim = .1): trimmed mean (fraction oftrimremoved from both the lowest and the highest values)median(): medianvar(): (sample) variance (with Bessel’s correction)sd(): (sample) standard deviation (with Bessel’s correction)min(),max(): lowest and highest value in a vectorquantile(): sample quantiles (quartiles by default)IQR(): interquartile range (the difference between upper and lower quartile)- Different packages offer functions for skew and kurtosis (though most of them actually mean excess, not kurtosis), e.g.
psych::skew()andpsych::kurtosi().
gen_blogs |> group_by(Length) |>
summarise(Lemma_count = n(), s_genitives = sum(s.Genitiv),
es_genitives = sum(es.Genitiv))gen_blogs |> group_by(Syllables) |>
summarise(Lemma_count = n(), s_genitives = sum(s.Genitiv),
es_genitives = sum(es.Genitiv))Let’s see some summary statistics for one of the variables in romane:
romane |> group_by(Genre) |>
summarise(n = n(),
TTR_mean = mean(Type_token_ratio),
TTR_median = median(Type_token_ratio),
TTR_sd = sd(Type_token_ratio),
TTR_IQR = IQR(Type_token_ratio),
TTR_min = min(Type_token_ratio),
TTR_max = max(Type_token_ratio),
TTR_skew = skew(Type_token_ratio),
TTR_excess = kurtosi(Type_token_ratio))Lots of packages offer convenient summary functions. Here are just a few examples:
summary(romane) # base R function ID Genre Kategorie
Length:269 Hochliteratur :60 Hochliteratur : 60
Class :character Horror :51 Schemaliteratur:209
Mode :character Krimi :38
Liebesroman :60
Science-Fiction:60
Autor Titel Type_token_ratio Dispersion
Length:269 Length:269 Min. :0.2458 Min. :0.7879
Class :character Class :character 1st Qu.:0.2859 1st Qu.:0.8058
Mode :character Mode :character Median :0.3004 Median :0.8107
Mean :0.3045 Mean :0.8128
3rd Qu.:0.3189 3rd Qu.:0.8199
Max. :0.4199 Max. :0.8485
Disparity Evenness Density Rarity
Min. :0.5449 Min. :0.9144 Min. :0.3383 Min. :0.4805
1st Qu.:0.6062 1st Qu.:0.9309 1st Qu.:0.3879 1st Qu.:0.5893
Median :0.6283 Median :0.9338 Median :0.4173 Median :0.6201
Mean :0.6347 Mean :0.9339 Mean :0.4136 Mean :0.6193
3rd Qu.:0.6616 3rd Qu.:0.9370 3rd Qu.:0.4330 3rd Qu.:0.6475
Max. :0.8094 Max. :0.9472 Max. :0.4852 Max. :0.7340
Average_token_length_syllables MTLD Honore_H
Min. :1.544 Min. :107.0 Min. :2187
1st Qu.:1.630 1st Qu.:187.4 1st Qu.:2454
Median :1.667 Median :203.9 Median :2561
Mean :1.697 Mean :205.6 Mean :2629
3rd Qu.:1.752 3rd Qu.:221.0 3rd Qu.:2705
Max. :2.027 Max. :346.9 Max. :3725
psych::describe(romane) # psych; there's also describeBy() for groupsskim(romane) # skimr; works with group_by() and can be customised| Name | romane |
| Number of rows | 269 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| factor | 2 |
| numeric | 9 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ID | 0 | 1 | 12 | 12 | 0 | 269 | 0 |
| Autor | 0 | 1 | 9 | 23 | 0 | 115 | 0 |
| Titel | 0 | 1 | 5 | 94 | 0 | 269 | 0 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Genre | 0 | 1 | FALSE | 5 | Hoc: 60, Lie: 60, Sci: 60, Hor: 51 |
| Kategorie | 0 | 1 | FALSE | 2 | Sch: 209, Hoc: 60 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Type_token_ratio | 0 | 1 | 0.30 | 0.03 | 0.25 | 0.29 | 0.30 | 0.32 | 0.42 | ▃▇▃▁▁ |
| Dispersion | 0 | 1 | 0.81 | 0.01 | 0.79 | 0.81 | 0.81 | 0.82 | 0.85 | ▂▇▃▂▁ |
| Disparity | 0 | 1 | 0.63 | 0.04 | 0.54 | 0.61 | 0.63 | 0.66 | 0.81 | ▃▇▃▁▁ |
| Evenness | 0 | 1 | 0.93 | 0.00 | 0.91 | 0.93 | 0.93 | 0.94 | 0.95 | ▁▁▇▇▂ |
| Density | 0 | 1 | 0.41 | 0.03 | 0.34 | 0.39 | 0.42 | 0.43 | 0.49 | ▁▇▇▇▂ |
| Rarity | 0 | 1 | 0.62 | 0.05 | 0.48 | 0.59 | 0.62 | 0.65 | 0.73 | ▁▃▇▅▂ |
| Average_token_length_syllables | 0 | 1 | 1.70 | 0.09 | 1.54 | 1.63 | 1.67 | 1.75 | 2.03 | ▇▇▃▂▁ |
| MTLD | 0 | 1 | 205.59 | 34.75 | 107.01 | 187.36 | 203.88 | 220.99 | 346.92 | ▁▇▇▂▁ |
| Honore_H | 0 | 1 | 2628.62 | 263.70 | 2186.66 | 2453.95 | 2561.41 | 2705.48 | 3725.18 | ▇▇▂▁▁ |
romane |>
group_by(Genre) |>
skim()| Name | group_by(romane, Genre) |
| Number of rows | 269 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| factor | 1 |
| numeric | 9 |
| ________________________ | |
| Group variables | Genre |
Variable type: character
| skim_variable | Genre | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|---|
| ID | Hochliteratur | 0 | 1 | 12 | 12 | 0 | 60 | 0 |
| ID | Horror | 0 | 1 | 12 | 12 | 0 | 51 | 0 |
| ID | Krimi | 0 | 1 | 12 | 12 | 0 | 38 | 0 |
| ID | Liebesroman | 0 | 1 | 12 | 12 | 0 | 60 | 0 |
| ID | Science-Fiction | 0 | 1 | 12 | 12 | 0 | 60 | 0 |
| Autor | Hochliteratur | 0 | 1 | 9 | 23 | 0 | 58 | 0 |
| Autor | Horror | 0 | 1 | 13 | 18 | 0 | 4 | 0 |
| Autor | Krimi | 0 | 1 | 13 | 13 | 0 | 1 | 0 |
| Autor | Liebesroman | 0 | 1 | 11 | 18 | 0 | 42 | 0 |
| Autor | Science-Fiction | 0 | 1 | 10 | 21 | 0 | 10 | 0 |
| Titel | Hochliteratur | 0 | 1 | 5 | 94 | 0 | 60 | 0 |
| Titel | Horror | 0 | 1 | 9 | 36 | 0 | 51 | 0 |
| Titel | Krimi | 0 | 1 | 8 | 31 | 0 | 38 | 0 |
| Titel | Liebesroman | 0 | 1 | 17 | 41 | 0 | 60 | 0 |
| Titel | Science-Fiction | 0 | 1 | 11 | 30 | 0 | 60 | 0 |
Variable type: factor
| skim_variable | Genre | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|---|
| Kategorie | Hochliteratur | 0 | 1 | FALSE | 1 | Hoc: 60, Sch: 0 |
| Kategorie | Horror | 0 | 1 | FALSE | 1 | Sch: 51, Hoc: 0 |
| Kategorie | Krimi | 0 | 1 | FALSE | 1 | Sch: 38, Hoc: 0 |
| Kategorie | Liebesroman | 0 | 1 | FALSE | 1 | Sch: 60, Hoc: 0 |
| Kategorie | Science-Fiction | 0 | 1 | FALSE | 1 | Sch: 60, Hoc: 0 |
Variable type: numeric
| skim_variable | Genre | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Type_token_ratio | Hochliteratur | 0 | 1 | 0.32 | 0.04 | 0.25 | 0.30 | 0.33 | 0.35 | 0.42 | ▃▅▇▃▁ |
| Type_token_ratio | Horror | 0 | 1 | 0.29 | 0.02 | 0.27 | 0.28 | 0.30 | 0.30 | 0.35 | ▅▃▇▁▁ |
| Type_token_ratio | Krimi | 0 | 1 | 0.29 | 0.01 | 0.26 | 0.28 | 0.29 | 0.30 | 0.31 | ▂▁▇▇▅ |
| Type_token_ratio | Liebesroman | 0 | 1 | 0.29 | 0.02 | 0.25 | 0.28 | 0.29 | 0.30 | 0.33 | ▁▇▇▇▂ |
| Type_token_ratio | Science-Fiction | 0 | 1 | 0.32 | 0.03 | 0.27 | 0.30 | 0.31 | 0.34 | 0.38 | ▃▇▃▃▁ |
| Dispersion | Hochliteratur | 0 | 1 | 0.82 | 0.01 | 0.79 | 0.82 | 0.83 | 0.83 | 0.85 | ▂▂▇▇▃ |
| Dispersion | Horror | 0 | 1 | 0.81 | 0.01 | 0.79 | 0.81 | 0.81 | 0.81 | 0.83 | ▂▅▇▂▁ |
| Dispersion | Krimi | 0 | 1 | 0.81 | 0.01 | 0.79 | 0.81 | 0.81 | 0.81 | 0.82 | ▁▁▂▇▃ |
| Dispersion | Liebesroman | 0 | 1 | 0.80 | 0.01 | 0.79 | 0.80 | 0.81 | 0.81 | 0.82 | ▂▅▇▇▂ |
| Dispersion | Science-Fiction | 0 | 1 | 0.82 | 0.01 | 0.80 | 0.81 | 0.82 | 0.82 | 0.83 | ▃▇▅▇▂ |
| Disparity | Hochliteratur | 0 | 1 | 0.67 | 0.03 | 0.59 | 0.65 | 0.68 | 0.70 | 0.72 | ▂▁▇▇▇ |
| Disparity | Horror | 0 | 1 | 0.61 | 0.02 | 0.56 | 0.59 | 0.61 | 0.63 | 0.68 | ▅▇▇▁▁ |
| Disparity | Krimi | 0 | 1 | 0.63 | 0.03 | 0.58 | 0.62 | 0.64 | 0.65 | 0.68 | ▂▅▇▆▂ |
| Disparity | Liebesroman | 0 | 1 | 0.63 | 0.04 | 0.54 | 0.61 | 0.62 | 0.64 | 0.81 | ▂▇▂▁▁ |
| Disparity | Science-Fiction | 0 | 1 | 0.62 | 0.04 | 0.56 | 0.60 | 0.62 | 0.64 | 0.73 | ▃▇▆▃▁ |
| Evenness | Hochliteratur | 0 | 1 | 0.93 | 0.01 | 0.91 | 0.93 | 0.93 | 0.94 | 0.95 | ▂▃▇▇▂ |
| Evenness | Horror | 0 | 1 | 0.93 | 0.00 | 0.92 | 0.93 | 0.93 | 0.93 | 0.94 | ▁▃▇▃▁ |
| Evenness | Krimi | 0 | 1 | 0.93 | 0.00 | 0.93 | 0.93 | 0.93 | 0.94 | 0.94 | ▃▇▃▃▂ |
| Evenness | Liebesroman | 0 | 1 | 0.93 | 0.00 | 0.92 | 0.93 | 0.93 | 0.94 | 0.94 | ▁▃▇▇▁ |
| Evenness | Science-Fiction | 0 | 1 | 0.94 | 0.00 | 0.93 | 0.93 | 0.94 | 0.94 | 0.95 | ▅▇▆▆▂ |
| Density | Hochliteratur | 0 | 1 | 0.41 | 0.03 | 0.34 | 0.39 | 0.41 | 0.43 | 0.46 | ▂▃▇▆▅ |
| Density | Horror | 0 | 1 | 0.42 | 0.03 | 0.37 | 0.38 | 0.42 | 0.43 | 0.46 | ▆▁▃▇▃ |
| Density | Krimi | 0 | 1 | 0.41 | 0.02 | 0.38 | 0.40 | 0.41 | 0.43 | 0.44 | ▆▇▃▆▇ |
| Density | Liebesroman | 0 | 1 | 0.39 | 0.02 | 0.36 | 0.38 | 0.39 | 0.40 | 0.42 | ▃▇▅▆▃ |
| Density | Science-Fiction | 0 | 1 | 0.45 | 0.02 | 0.40 | 0.43 | 0.44 | 0.46 | 0.49 | ▂▇▇▅▂ |
| Rarity | Hochliteratur | 0 | 1 | 0.61 | 0.06 | 0.48 | 0.58 | 0.62 | 0.65 | 0.73 | ▂▅▇▅▃ |
| Rarity | Horror | 0 | 1 | 0.63 | 0.03 | 0.57 | 0.61 | 0.63 | 0.64 | 0.71 | ▅▆▇▂▁ |
| Rarity | Krimi | 0 | 1 | 0.59 | 0.03 | 0.51 | 0.59 | 0.60 | 0.62 | 0.65 | ▁▂▂▇▃ |
| Rarity | Liebesroman | 0 | 1 | 0.59 | 0.03 | 0.50 | 0.57 | 0.59 | 0.61 | 0.65 | ▁▂▇▇▂ |
| Rarity | Science-Fiction | 0 | 1 | 0.67 | 0.03 | 0.60 | 0.65 | 0.66 | 0.70 | 0.73 | ▁▆▇▃▅ |
| Average_token_length_syllables | Hochliteratur | 0 | 1 | 1.69 | 0.07 | 1.54 | 1.64 | 1.70 | 1.72 | 1.83 | ▂▇▇▅▂ |
| Average_token_length_syllables | Horror | 0 | 1 | 1.63 | 0.04 | 1.57 | 1.60 | 1.63 | 1.65 | 1.79 | ▅▇▂▁▁ |
| Average_token_length_syllables | Krimi | 0 | 1 | 1.64 | 0.04 | 1.59 | 1.61 | 1.64 | 1.65 | 1.76 | ▆▇▁▁▂ |
| Average_token_length_syllables | Liebesroman | 0 | 1 | 1.66 | 0.04 | 1.55 | 1.63 | 1.66 | 1.70 | 1.76 | ▁▇▆▇▂ |
| Average_token_length_syllables | Science-Fiction | 0 | 1 | 1.83 | 0.07 | 1.71 | 1.79 | 1.82 | 1.87 | 2.03 | ▃▇▅▂▁ |
| MTLD | Hochliteratur | 0 | 1 | 186.32 | 47.16 | 107.01 | 153.17 | 185.44 | 211.27 | 295.81 | ▅▅▇▂▂ |
| MTLD | Horror | 0 | 1 | 203.69 | 15.50 | 176.89 | 191.90 | 203.88 | 213.40 | 241.53 | ▆▆▇▅▁ |
| MTLD | Krimi | 0 | 1 | 199.56 | 18.50 | 168.85 | 186.00 | 194.60 | 210.57 | 254.07 | ▆▇▃▃▁ |
| MTLD | Liebesroman | 0 | 1 | 204.60 | 19.66 | 163.86 | 189.53 | 206.92 | 216.00 | 265.29 | ▃▆▇▂▁ |
| MTLD | Science-Fiction | 0 | 1 | 231.27 | 37.14 | 178.74 | 201.90 | 227.06 | 248.72 | 346.92 | ▇▇▂▂▁ |
| Honore_H | Hochliteratur | 0 | 1 | 2913.82 | 354.02 | 2241.32 | 2629.40 | 2903.46 | 3133.62 | 3725.18 | ▃▇▇▃▂ |
| Honore_H | Horror | 0 | 1 | 2546.50 | 128.74 | 2335.98 | 2442.36 | 2548.56 | 2610.74 | 3009.31 | ▆▇▅▁▁ |
| Honore_H | Krimi | 0 | 1 | 2449.41 | 73.96 | 2230.01 | 2420.46 | 2446.24 | 2486.39 | 2626.74 | ▁▁▇▃▂ |
| Honore_H | Liebesroman | 0 | 1 | 2505.85 | 130.14 | 2186.66 | 2430.71 | 2501.20 | 2573.43 | 2882.01 | ▁▇▇▃▁ |
| Honore_H | Science-Fiction | 0 | 1 | 2649.50 | 176.44 | 2403.01 | 2501.29 | 2603.43 | 2752.55 | 3071.16 | ▇▆▅▃▂ |
Does the lemma end in s, ß, z or x?
gen_blogs$Ends_in_s <- factor(ifelse(str_sub(gen_blogs$Lemma, start = -1) %in% c("s", "ß", "z", "x"), "yes", "no"))
gen_blogsgen_blogs |> group_by(Ends_in_s) |>
summarise(s = sum(s.Genitiv), es = sum(es.Genitiv))Handling missing data
Missing data should always be NA in R. Pay special attention to characters vectors – empty strings should probably often be NA as well.
Many functions will throw errors when they encounter missing values. Luckily, many of them (e.g. mean(), sd()) also have the optional argument na.rm that you can set to TRUE, so they’ll ignore any missing data.
The function na.omit() will drop missing values from a vector; it will drop all rows containing missing values from a matrix or data.frame. So think carefully before using it on whole data sets – you might throw away useful data. (The same goes for the Tidyverse function drop_na().)
The package tidyr (from the Tidyverse) also provides further useful functions like replace_na() (say you want values of 0 instead of NA for specific columns) or fill().